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Abstract —Personalizing image tags is a relatively new and 
growing area of research, and in order to advance this research 
community, we must review and challenge the de-facto standard 
of defining tag importance. We believe that for greater progress 
to be made, we must go beyond tags that merely describe objects 
that are visually represented in the image, towards more user¬ 
centric and subjective notions such as emotion, sentiment, and 
preferences. 

We focus on the notion of user preferences and show that 
the order that users list tags on images is correlated to the 
order of preference over the tags that they provided for the 
image. While this observation is not completely surprising, to our 
knowledge, we are the first to explore this aspect of user tagging 
behavior systematically and report empirical results to support 
this observation. We argue that this observation can be exploited 
to help advance the image tagging (and related) communities. 

Our contributions include: 1.) conducting a user study demon¬ 
strating this observation, 2.) collecting a dataset with user tag 
preferences explicitly collected. 
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I. Introduction 

With the proliferation of cheap imaging devices (e.g., 
smartphone cameras, point-and-shoots, SLRs and DSLRs) and 
content sharing websites (e.g., Flickr, Tumblr, Instagram, etc), 
the size of personal image collections has been growing 
rapidly, making it unfeasible for users to manually tag all 
images in their collections. For example, on average, 130 
million images are uploaded on Tumblr (!]] and more than 
90% of those images have no identifying text or tags ©• 
This makes the task of automatic image annotation all the 
more important, and with the lack of semantic understanding 
of images (“semantic gap”) this task becomes very difficult. 

Much of the work in automatic image tagging has ignored 
the user factor ©-G3 by trying to find what we denote 
as statistical correlations between the image content (visual 
features) and objective semantics regardless of the particular 
users involved in the tagging activity. There has been some 
work that focuses on user personalization in automated image 
tagging, most notably, (5j, which we consider as the state- 
of-art in this domain. Along the lines of object importances 
as they relate to tags, the focus has mainly been on an 
explicitly categorical definition of importance by measuring 
properties of content/objects in the image (e.g., size, salience, 
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etc) to estimate their relative importances |[13). These content 
property based approaches to importance also tends to ignore 
particular user effects and preferences, treating importance 
as a purely global phenomenon p3| , fl4| . In our day and 
age where content is increasingly personalized and tailored to 
user tastes, we believe that it is of paramount importance to 
systematically understand user tagging behavior and trends. 

A recent attempt at a design change on Flickr fl5| , and the 
subsequent reversal of the change, demonstrates our second 
assumption. The Flickr designers opted to update the site 
to present user generated tags in reverse-chronological order, 
and immediately active Flickr users protested this change, 
citing that the order that they presented their tags was in¬ 
tentional |T6| , leading to an apology by the designers and a 
reverting back to the original chronological order design. This 
event lead us to believe when providing tag lists, users are 
not merely motivated by visually measurable properties such 
as saliency, but more so by implicit biases and preferences 
which are in turn reflected in the order of tagging list. 

In the subsequent sections, we investigate our aforemen¬ 
tioned hypotheses via a user study conducted using the Ama¬ 
zon Mechanical Turk (AMT) system, and compare to more 
popular global notions of importance. 

A. Related Works 

There are two ways to approach image tagging. First, 
explicit object tagging, where an image is tagged with a 
particular word if the object the word represents is detected 
as being in the image. Second, implicit tagging, where the 
query image is compared to other similar images, and the 
tags are “transferred” from the most similar images to the 
query image, via some scoring function. Many applications 
of this implicit approach take their cue from the world of 
collaborative filtering [10}, GZ) 

The implicit approaches are usually more common than 
their explicit counterparts because one does not have to learn 
how to recognize or detect specific objects in the image, which 
as earlier noted is not scalable, also not all concepts one 
would like to use in describing an image are necessarily visual 
(semantic gap) [ p~8| , ED- Also, with the implicit approaches 
one could imagine a latent space that more readily embeds 
some sense of relatedness p0| , while on the explicit end, it is 
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Fig. 1. This figure shows screenshot of our AMT experiment. The left image, is from the initial tag collecting round, the middle from the second tag-elimination 
round, and the right from the verification round. 


harder to extrapolate a measure of relatedness between objects 
of different classes. 

With regards to personalization in image tagging, in the 
work by Rendle et al. (20), they assume that tag mentioned 
are preferred to those unmentioned. This is similar to our 
assumption but they treat the tags that appear together on 
a tagging list equally, which our work here suggests not to 
do. In the work by Lipczak et al. ED they also treat tags 
as essentially structureless entities (bag of words), and other 
work on personalization similarly treat user provided tag lists 
as such 0 ED To our knowledge, we are the first to 
suggest that these user provided list should be treated as having 
structure. 

II. User Study 

In order to verify our assumption that users tend to place the 
tags they prefer or consider most important at the start of tag 
lists, we found it prudent to conduct a live user experiment 
using the AMT system. Our main metric of interest is the 
rank correlation between tags lists when we explicitly request 
and ascertain the preference order of tags they provided for an 
image, versus the order of the same tags without such a prompt 
for ranking their preferences. In the following, we will detail 
the setup of our user study, our metrics and measurements, 
comparison to other measures of (global) importance, and our 
conclusions from the user study. 

A. Study Setup 

We conducted our study on a subset of 500 images from the 
NUS-WIDE dataset (22) which is a dataset of images from the 
popular photoblogging website Flickr ©• These images were 
divided into 100 groups of 5 to create 100 Human Intelligence 
Tasks (HITs), which is the smallest indivisible unit of work 
on AMT. Each HIT was then assigned to 15 different study 
participants (turkers), totaling 1500 assignments. 

For each HIT, our study was done in 2 stages as shown 
in Fig. [I] In the first stage, we asked the turker to provide 
5 tags for each of the 5 images contained in the HIT, we 
refer to this as the Tag Allocation Stage. The tags are 
allocated for all of the 5 images before we begin the next 
stage of the experiment. In the second stage of the study, we 
iteratively asked the turker to eliminate their least preferred 
tag for each image from the set of tags that had not yet been 
eliminated for that image in previous iterations. For each round 


of elimination, we randomly scrambled the order of the tags 
that were left from previous rounds to prevent turkers from 
any influence of presentation bias. Similarly, the order of the 
images in each elimination round was randomly shuffled to 
prevent presentation bias. We refer to the second stage as the 
Preference Allocation Stage. 

We also added a hidden verification test as part of the prefer¬ 
ence allocation stage to ensure that the preferences which the 
turker provided were consistent when asked a second time. 
To that end, after reconstructing their preference order from 
4 elimination rounds, for each image, we asked the turker to 
eliminate their least preferred tag among 2 randomly chosen 
tags of those which the user had provided for that image. If 
their response matched their reconstructed preference order, 
then we considered the user's preference order for that image 
verified otherwise not. So within each HIT we can tell which 
of the reconstructed preference lists are reliable, and so on the 
level of HITs we can define the HIT reliability as the number 
of verified preference orders within the HIT. 

At the end of the experiment we are left with 7500 pairs 
of tag lists from 391 turkers. Each tag list pair consists of a 
tag list in the default order and the same tags in the user’s 
preferred order. And for each of those pairs we know whether 
or not the preferred order is verified and use that as a proxy 
to its reliability. 

B. Metrics and Measurements 

To verify our assumption that users tend to present their 
tag lists with an inherent preference order as opposed to 
being an orderless set or bag-of-words, we examine the data 
collected from our AMT user study. To measure whether the 
data supports our claim, we employ 2 metrics: 1.) Kendall’s 
Tau Rank Correlation (23) , and 2.) Spearman’s Rho Rank 
Correlation (24) , which are both measures of how much two 
rankings are correlated with one another. Both measures range 
from -1 to 1, with -1 indicating perfect negative correlation, 
0 indicating no correlation, and 1 indicating perfect positive 
correlation. 

We measure the average correlation per user, and the 
average correlation per image. We also measure the effect 
of the verification of the preference order on the correlation 
scores, and report our final numbers based on data that has 
been reasonably verified. We also present the confusion matrix 
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Fig. 2. This figure shows the confusion matrix between the preferred position of tags (groundtruth label) and the initial position of tags (predicted label) at 
different levels of verification per HIT. The verification level is the number of images (out of 5) that were verified within the hit. 

TABLE I 



Avg. r corr 

Var. r corr 

Avg. p corr 

Var. p corr 

min. verification/total 

num verified/total 

Per Assignment 

0.3089 

0.062 

0.3705 

0.083 

4/5 

1114/1500 

Per User 

0.3046 

0.047 

0.3652 

0.064 

3.5/5 

298/391 

Per Image 

0.2840 

0.017 

0.3400 

0.021 

11/15 

434/500 


This table shows the correlation statistics between the initial rank of tags, and the preference ranks as provided by the user. We provide the averages per 
assignment, per user (averaging over all images for the user), and per image (averaging over all users for that image). 


between the position “labels”, that is, assuming each tag is 
labeled with it’s position from the initial order, how well does 
its position “label” on the preference order list predict its 
position “label” from the initial order. 

As we can see from the confusion matrices in Fig. [2j 
assuming that the verification level (here as the number of 
images verified within a single HIT) is a proxy of the turker’s 
attention to the task (and hence reliability), the position of a 
tag on the initial list is a good predictor of the position of 
the tag according to the turker’s preference. As the reliability 
increases, more often than not, the initial position is the same 
as the preferred position, and any “mislabeling” is typically 
within an error of 1 position. 

In Fig. [3] we see that from the reliable HITs, there seems 
to be a moderately high correlation (which is statistically 
significantly different from being uncorrelated, as validated 
by a two-sided 1 sample t-test with p-values less than 0.01) 
between the initial tag list, and the reconstructed preference 
order, and we believe we are the first to show empirically, 
the existence of such phenomenon, which apriori is not so 
obvious. 

Each image is tagged by 15 different turkers, and most 
of the images, more often than not, were tagged reliably by 
the turkers. From Fig. [4] we can see that the aforementioned 
correlation is largely independent of the image, as even those 
images with less reliable preference orders show a moderate 
correlation between the reconstructed preference tag order and 


the initial tag list, so it doesn’t seem to be the case that image 
visual content itself is the cause of the correlation. When 
we consider the average correlation per user as is shown in 
Fig. [5] we also observe the similar trend that users that have 
tagged images more reliably show on average a moderately 
high correlation between the reconstructed preference order, 
and even the less reliable users still exhibit a slight correlation 
as well. Our results are summarized in Table |I| and these are 
statistically significant as verified by a one sample t-test with 
respect to 0 correlation. 

C. Comparison to Global Importance 

In much of image tagging research (B), (14), (25), 
(26), tag importance is usually considered in terms of what is 
visually represented in the image, and typically by saliency. To 
that end, many researchers use tag frequency as a proxy to tag 
importance and saliency 0, (25), and for nearest neighbor 
approaches to tagging, predicting the tags that are based on 
the most frequent has had relative success in terms of tag 
recall |5|. 

In this section, we compare the reconstructed user prefer¬ 
ence order to the frequency order gotten from the number 
of times the tag was mentioned by turkers for that image. 
In Table [ll| we report the correlation statistics. As we can 
see, there is a slight correlation between the preference and 
the frequency, but it is not that strong, which suggests that 
although users might mention tags of global importance (or 


























































TABLE II 



Avg. r corr 

Var. r corr 

Avg. p corr 

Var. p corr 

Overall 

0.187417923625 

0.177135465861 

0.220259416265 

0.232534427934 

Image (avg. over users) 

0.186084678459 

0.0230843379766 

0.218679398053 

0.0306837413827 


This table shows the correlation statistics between the frequency rank of tags, and the preference ranks as provided by the user. The frequency rank of a tag 
for an image is derived from the number of times it was mentioned by all the turkers that tagged the given image. We provide the correlation statistics over 
all the tag list, and also averaged across the users for each image. We only report the statistics for images that were verified, using all the images results in 
even lower correlation. 




Fig. 3. This figure shows the average correlation (and error bars) between the initial tag list, and the reconstructed preference order with respect to the level 
of reliability. The Kendall’s Tau correlation is shown on the left, and Spearman’s Rho on the right. 



Fig. 4. This figure shows the average correlation per image (and error bars) between the initial tag list, and the reconstructed preference order with respect 
to the number of times the image has been reliably tagged. The Kendall’s Tau correlation is shown on the left, and Spearman’s Rho on the right. 


salient tags) in their tag lists, those tags are usually not their 
most preferred. 

In order to verify that suggestion, we also report the average 
position of the most frequently mentioned tag for an image on 
the reconstructed preference ordered list for the same image 
and notice that more often than not, the most frequent tag is 
usually mentioned later in the preference order as is seen in 


Table |ml 

D. Study Summary 

From our study we arrive at the following conclusions: 1.) 
The order that users provide in their tag list for an image is 
moderately correlated to their inherent preferences over those 
tags, 2.) This preferred order is not as simple as the order 


































Kendall's Tau Correlation per User 


, 1 . 011 . 0 ,, 1 . 511 . 5 , 2 . 012 . 0 , 2 . 512 . 5 , 3 . 013 . 0 , 3 . 513 . 5 , 4 . 014 .. 0 , 4 . 514 .- 5 , 5.01 
- ~ - - #126 #60 #112 




2 . 512 . 5 , 3 . 013 . 0 , 3 . 513 . 5 , 4 . 014 . 0 , 4 . 514 . 5 , 5.01 
- - #126 #60 #112 


Fig. 5. This figure shows the average correlation (and error bars) between the initial tag list, and the reconstructed preference order with respect to the level 
of reliability. The Kendall’s Tau correlation is shown on the left, and Spearman’s Rho on the right. 


of objects in the image from most salient to least salient, 
nor the same as other global notions of preference, and 3.) 
Hence in understanding user tagging behavior and inferring 
user preferences, one should consider the order that users 
present their tags for images. 

We believe that this study will help further the development 
of research in the area of image tagging, and that using 
the observations provided by this study, could improve upon 
current state-of-art methods for image tagging, especially with 
respect to personalization. 

TABLE III 

| 11 Average | Variance | 

| Position of most frequent tag || 3.7652 | 0.356735173152 | 

This table shows the average position of the most frequently mentioned tag 
for an image, and its variance, in the preference list given by the users. As 
we can see out of the 5 tags given by the user, the most frequently mentioned 
ones tend to be closer to the bottom of the list, i.e., less preferred. We only 
report the statistics for images that were verified. 

III. Conclusion and Future Work 

In this work we proposed a new measurement of tag 
preferences, and demonstrated that there is indeed a tag-order 
bias, that is, when a user mentions tag a before tag b , in a 
list of tags for a given image, the user is implying that he/she 
prefers, or considers a to be of greater importance than b. This 
leads us to conclude that although there are many visual factors 
that may affect what tags a user will provide for an image, it 
is more useful to characterize instead (or rather in conjunction 
with) the users’ tagging habits to learn what tags are of more 
importance to the users, whether visually motivated or not, 
and automatic tagging systems should employ this technique 
to improve their overall performance. 

It is also important to note that this study was not tied 
to any particular online tagging system, like Flickr, and as 


such we believe that the findings in this study are independent 
of the online platform, as opposed to being an artifact of 
the user interface. Hence, the findings should hold on most 
online tagging systems, or at least image tagging systems 
that allow for user input via text. One direct way we believe 
this preference information can be exploited is, given a user’s 
tagging history, if tags a and b frequently occur on the same 
tag lists for images, and tag a is mentioned before b more 
often than the reverse, in predicting a tag list for a new image 
for that user, this preference order should be enforced as it 
reflects a preference for a over b for that user. 

Another future direction, assuming we can embed the tags 
into some metric space, is, we believe it would be interesting 
to learn a function that takes as input, a pair of features 
(each representing a tag) and returns a prediction of the pair 
preference order and strength. This will enable us to “transfer” 
preferences between tags that are similar (or closely related) 
even though we might never have observed them together 
for a particular user. We would also like to analyze what 
kinds/categories of tags are preferred over others under this 
framework, and answer the question, do these categorical 
relationships depend on the user (i.e., do the users cluster in a 
way such that the different clusters exhibit different categorical 
relationships)? For example do some user tend to tag images 
in a bottom-up fashion with respect to ontologies, and other 
users in a top-bottom fashion? 
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